Goto

Collaborating Authors

 image description


Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Neural Information Processing Systems

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs.


Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions

Neural Information Processing Systems

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another way is through human labeling.


DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Neural Information Processing Systems

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, spatial relations.